{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qx9ZBUUkQwvG"
      },
      "source": [
        "<img src=\"https://github.com/chdb-io/chdb/raw/pybind/docs/_static/snake-chdb.png\" width=320 >\n",
        "\n",
        "# chdb-GPT\n",
        "Generate **chDB** and **ClickHouse** queries using natural language and  **OpenAI APIs**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Gyo2KvBkMUHI"
      },
      "outputs": [],
      "source": [
        "#@title Install Requirements { display-mode: \"form\" }\n",
        "!pip install openai chdb --quiet"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "BQH0cFUQMLZ1"
      },
      "outputs": [],
      "source": [
        "#@title Provide OpenAI API { display-mode: \"form\" }\n",
        "openai_api_key = \"\" #@param {type:\"string\"}\n",
        "import openai\n",
        "\n",
        "openai.api_key = openai_api_key\n",
        "\n",
        "# Set this to `True` if you need GPT4. If not, the code will use GPT-3.5.\n",
        "GPT4 = False"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "-GYCp4HzQ9_X"
      },
      "outputs": [],
      "source": [
        "#@title Prepare ClickHouse GTP Agent { display-mode: \"form\" }\n",
        "class Conversation:\n",
        "    \"\"\"\n",
        "    This class helps me keep the context of a conversation. It's not\n",
        "    sophisticated at all and it simply regulates the number of messages in the\n",
        "    context window.\n",
        "\n",
        "    You could try something much more involved, like counting the number of\n",
        "    tokens and limiting. Even better: you could use the API to summarize the\n",
        "    context and reduce its length.\n",
        "\n",
        "    But this is simple enough and works well for what I need.\n",
        "    \"\"\"\n",
        "\n",
        "    messages = None\n",
        "\n",
        "    def __init__(self):\n",
        "        # Here is where you can add some personality to your assistant, or\n",
        "        # play with different prompting techniques to improve your results.\n",
        "        Conversation.messages = [\n",
        "            {\n",
        "                \"role\": \"system\",\n",
        "                \"content\": (\n",
        "                    \"You are a ClickHouse expert specializing in OLAP databases, SQL format, and functions. You can produce SQL queries using knowledge of ClickHouse's architecture, data modeling, performance optimization, query execution, and advanced analytical functions.\"\n",
        "                ),\n",
        "            }\n",
        "        ]\n",
        "\n",
        "\n",
        "    def answer(self, prompt):\n",
        "        \"\"\"\n",
        "        This is the function I use to ask questions.\n",
        "        \"\"\"\n",
        "        self._update(\"user\", prompt)\n",
        "\n",
        "        response = openai.ChatCompletion.create(\n",
        "            model=\"gpt-4-0613\" if GPT4 else \"gpt-3.5-turbo-0613\",\n",
        "            messages=Conversation.messages,\n",
        "            temperature=0,\n",
        "        )\n",
        "\n",
        "        self._update(\"assistant\", response.choices[0].message.content)\n",
        "\n",
        "        return response.choices[0].message.content\n",
        "\n",
        "    def _update(self, role, content):\n",
        "        Conversation.messages.append({\n",
        "            \"role\": role,\n",
        "            \"content\": content,\n",
        "        })\n",
        "\n",
        "        # This is a rough way to keep the context size manageable.\n",
        "        if len(Conversation.messages) > 20:\n",
        "            Conversation.messages.pop(0)\n",
        "\n",
        "\n",
        "    def build_query_prompt(query):\n",
        "\n",
        "        input_str=f\"\"\"\n",
        "        You are a ClickHouse expert specializing in OLAP databases, SQL format, and functions. You can produce SQL queries using knowledge of ClickHouse's architecture, data modeling, performance optimization, query execution, and advanced analytical functions.\n",
        "        I would like you to generate an accurate ClickHouse sql query for the question:\n",
        "        {query}\n",
        "\n",
        "        - Make sure the query is ClickHouse compatible\n",
        "        - Make sure ClickHouse SQL and ClickHouse functions are used\n",
        "        - Assume there are no tables in memory, data is always remote\n",
        "        - Load data from files using the file() ClickHouse function, for instance: file('data.csv')\n",
        "        - Load data from urls containing http using the url() ClickHouse function, for instance url('http://domain.com/file.csv')\n",
        "        - Make sure any file hosted on s3 is loaded using the s3() ClickHouse function\n",
        "        - Ensure case sensistivity\n",
        "        - Ensure NULL check\n",
        "        - Do not add any special information or comment, just return the query\n",
        "\n",
        "        The expected output is code only. Always use table name in column reference to avoid ambiguity.\n",
        "        \"\"\"\n",
        "\n",
        "        return input_str\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SkSjhVDvUZHa"
      },
      "source": [
        "Let's input our query and form a prompt:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "wDXDTbooM8mE"
      },
      "outputs": [],
      "source": [
        "#@title Prompt using Natural Language { display-mode: \"form\" }\n",
        "query = \"show the top 10 towns from url https://datasets-documentation.s3.eu-west-3.amazonaws.com/house_parquet/house_0.parquet\"  #@param {type:\"string\"}\n",
        "\n",
        "prompt = Conversation.build_query_prompt(query)\n",
        "\n",
        "conversation = Conversation()\n",
        "\n",
        "answer = conversation.answer(prompt)\n",
        "print(answer)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aM1NvWGOUfOa"
      },
      "source": [
        "Create a new instance of `Conversation` whenever you want to clear the context."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9k5x7gvUUn5z"
      },
      "source": [
        "We can now extend our query and the API will remember what we did before."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "MY06L5jvNMNQ"
      },
      "outputs": [],
      "source": [
        "#@title Refine SQL using Natural Language\n",
        "refine_query = \"add round(avg(price)) AS price to the query\" #@param {type:\"string\"}\n",
        "answer = conversation.answer(refine_query)\n",
        "query = answer.replace(\"```sql\",\"\").replace(\"```\",\"\")\n",
        "print(query)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "BKeDfL1fQ4XB"
      },
      "outputs": [],
      "source": [
        "#@title Execute Query using chDB { display-mode: \"form\" }\n",
        "import chdb\n",
        "res = chdb.query(query, 'Pretty'); print(res.data())"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
